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ABSTRACT 

Motivation: Current advances in DNA synthesis, cloning and sequen- 
cing technologies afford high-throughput implementation of artificial 
sequences into living cells. However, flexible computational tools for 
multi-objective sequence design are lacking, limiting the potential of 
these technologies. 

Results: We developed DNA-Tailor (D-Tailor), a fully extendable 
software framework, for property-based design of synthetic DNA 
sequences. D-Tailor permits the seamless integration of multiple 
sequence analysis tools into a generic IVIonte Carlo simulation that 
evolves sequences toward any combination of rationally defined prop- 
erties. As proof of principle, we show that D-Tailor is capable of 
designing sequence libraries comprising all possible combinations 
among three different sequence properties influencing translation 
efficiency in Escherichia coli. The capacity to design artificial se- 
quences that systematically sample any given parameter space 
should support the implementation of more rigorous experimental 
designs. 

Availability: Source code is available for download at https://source 
forge.net/projects/dtailor/ 

Contact: aparkin@lbl.gov or cambray.guillaume@gmail.com 
Supplementary information: Supplementary data are available at 
Bioinformatics online (D-Tailor Tutorial). 
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1 INTRODUCTION 

The accumulation of genomic data has fueled the development of 
numerous computational tools that infer functional behavior 
from biological sequences. These algorithms essentially capture 
our understanding of how functional information is encoded in 
nucleic acid and protein sequences. As a result, molecular biolo- 
gists can now access a plethora of sequence analysis tools to help 
them predict functional behaviors from plain sequences (Altschul 
et aL, 1997; Bailey et aL, 2009; Giardine et aL, 2005; Hofacker, 
2003; Kingsford et aL, 2007; Markham and Zuker, 2008; 
Thomas-Chollier et aL, 2011). Common tasks comprise the iden- 
tification of sequence motifs from nucleic acid (DNA/RNA) or 
protein sequences (e.g. promoter or termination activity, recom- 
bination or splicing sites), as well as the computation of sequence 
properties that are mechanistically linked to particular 
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pheno types (e.g. codon usage or propensity to form transmem- 
brane protein domains). 

Such sequence- analysis tools are usually used to inform biolo- 
gical discovery in natural genomic sequences. However, consider- 
ing recent advances in DNA technologies and the concomitant 
rise of synthetic biology applications (Cambray et aL, 20\\; Carr 
and Church, 2009; Czar et aL, 2009; Endy, 2005; Ma et aL, 2012), 
these same tools may also be leveraged to guide the design of 
artificial sequences satisfying predefined functions of interest. 

Ideally, elementary biological functions should be contained 
within well-defmed sequence parts that could be re-used with ac- 
ceptable rehability in different contexts [e.g. Davis et aL (2011) 
and Mutahk et aL (2013)]. However, it is becoming increasingly 
clear that many molecular behaviors result from the combined 
influence of several sequence determinants that cannot be neatly 
encapsulated within the physical boundaries of a single part, but 
rather emerge at the interface between the different parts 
(Cambray et aL, 2013; Kosuri et aL, 2013; Mutalik et aL, 2012; 
Salis et aL, 2009). In this context, the multidimensional examin- 
ation of DNA sequences becomes necessary to better capture the 
inherent complexity of biological behavior and further enable 
predictive design of synthetic sequence functions and activities 
[e.g. Allert et aL (2010), Dvir et aL (2013), Kinney et aL (2010), 
Na et aL (2013), Rhodius and Mutahk (2010), Rodrigo et aL 
(2012), Sahs et aL (2009), Seehg et aL (2006), Welch et aL (2009)]. 

Valuable sequence-design tools implementing heuristic 
searches have been successfully developed for multi-objective 
optimization within specific apphcations [e.g. protein synthesis 
optimization (Chung and Lee, 2012; Dana and Tuller, 2012; 
Caspar et aL, 2012, 2013; Raab et aL, 2010; Racle et aL, 2012; 
Sahs et aL, 2009; Welch et aL, 2011)]. However, application of 
such optimization procedures requires an objective function 
relating computed sequence properties to an expected perform- 
ance score. Unfortunately, the data and models required to de- 
scribe these relationships are generally not sufficient to support 
truly rehable functional design. 

Interestingly, sequence-design tools can also be used upstream 
of the optimization process to produce libraries of sequences that 
are more suited for the development of predictive models. 
Although large-scale studies have mostly used random 
approaches to introduce variability in the synthetic sequences to 
be interrogated (Dvir et aL, 2013; Quan et aL, 2011), similar en- 
deavours have greatly benefited from following well-estabhshed 
design of experiments (DoE) (Allert et aL, 2010; Antony, 2003; 
Kosuri et aL, 2013; Sharon et aL, 2012; Smith et aL, 2013). 
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Sequences 

^ >s001 

TATAGGCATAGCGCACAG| 
TAAATGTAAATTACAGAGl 
AACATCCAACGGTGCGGC 
>s002 

GGTGTGAATACAGCTTTTCCGCG 
ATAAAAATTGCAGCAGGCTTAAC 
CTTGACCGCTGTACAAGGTATAC 
TCGGACGATTTTCACTGTTTTGA 
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TATAGGCATAGCGCACAGACAGA 
TAAATGTAAATTACAGAGTACAC 
AACATCCAACGGTGCGGGCTGA 
AGGGTTCAAGATGCATCGATCGA 
TGCATCGGGTCAGCTAGCTAGCT 
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TATAGGCATAGCGCACAGACAGA 
TA AATGTAA ATTAC AC CCTAC AC 
AACATCCAACGGTGCGGGCTGA 
AGGACTCAAGATGCATCGATCGA 
TGCATCGGGTCAGCTAGCTAGCT 



TATAGGCATAGCGCACAGACAGA 
TAAATCAAAATTACAGAGTACAC 
AACATCCAACGGTGTGGGCTGA 
AGGGTTCAAGATGCATCGATCGA 
TGCATCGACCCAGCTAGCTAGCT 



□ 



TATAGTCATAGCGCACAGACAGA 
TAAATGTAAATTACAGAGTACAC 
AACATCCAACGGTGCAGGCTGA 
AGGGTCCAAGATGCATCGATCGA 
TGCATCGGGTCAGCTAGCTAGCT 



TATAGGCATAGCGCACAGACAGA 
TAAATGTAAATTACAGAGTACAC 
AACCAATAACGGTGCGGGCTGA 
AGGGTTCAAGATGCATCGATCGA 
TGCATCGGGTCAGCTAGCTAGCT 



Fig. 1. D-Tailor enables multidimensional analysis and design of DNA sequences. D-Tailor provides a flexible and extendable architecture to interrogate 
different sequence properties (box in the middle). The left panel depicts an example of the retrieval process of two properties (RNA structure and motif 
prediction) from multiple input sequences that can come from either FASTA or CSV files. The resulting score profile can be used to identify general 
trends and further define ideal parameter ranges for the design objectives. The right panel shows the design mode of D-Tailor, wherein a seed sequence is 
evolved to meet a user-defined combination of sequence properties. The figure depicts a full-factorial design for two different properties of interest (RNA 
structure and motif scores) with three levels each (low, medium and high), which yields a total of nine different combinations (colored areas). 



DoE is a general framework that fully integrates planning and 
analysis phases, and comprises three major steps. The first one 
consists in identifying the factors of interest and defining the 
range of values for each factor. In the case of molecular se- 
quences, factors are properties of the primary sequence itself 
and can be typically identified by reanalysing available functional 
genomic data and pubhshed mechanistic studies. The second 
step consists in implementing a particular experimental design 
wherein multiple combinations of factor levels are selected to 
create an experimental dataset providing maximal information 
to relate the design factors to the response variable(s). For 
example, one of the most informative DoE is the full-factorial 
design, where all possible combinations of factor levels across 
the different factors are performed. The resulting dataset not 
only permits to estimate the contribution of each factor to the 
measured response variable, but also robustly captures the inter- 
actions between the different factors (Antony, 2003; Mutalik 
et ai, 2013). Last, the third step includes the collection of experi- 
mental data and definition of a model relating the multiple fac- 
tors to the response variable(s). Of note, this can be an iterative 
process wherein models derived from the third phase can inform 
the design of a new set of experiments. 

Although implementation of experimental designs systematic- 
ally varying easily manipulated factors can be straightforward 
(e.g. growth medium, pH, temperature or oxygen levels), the 
ability to design artificial sequences whose intrinsic properties 
can be systematically varied is not necessarily trivial (e.g. binding 
site affinity or the strength of an RNA secondary structure). 

Here, we present D-Tailor, an extendable framework support- 
ing integration of multiple sequence analysis tools to mine and 



design biological sequences. D-Tailor uses a heuristic search 
algorithm to enable fiexible design of synthetic sequences varying 
multiple properties of interest so as to satisfy complex DoE. 
We have validated our tool by successfully designing artificial 
sequence libraries conforming to full-factorial designs, which rep- 
resent the upper bound of experimental design complexity. More 
specifically, we have designed libraries systematically varying 
multiple sequence properties known to impact translation 
efficiency in E.coli. To further demonstrate the versatility of 
the algorithm, we also used D-Tailor to design artificial bacterial 
promoter sequences varying multiple c/^-regulatory properties 
(see Supplementary Material). 

2 METHODS 

D-Tailor essentially implements the two-step planning process outlined 
above (Fig. 1). The analysis mode computes property scores from plain 
biological sequences. Here, the user specifies input sequences and a pre- 
defined set of properties to be computed. The design mode integrates the 
analysis routines with a parameterizable Monte Carlo algorithm that 
evolves an input sequence (seed) so as to match the specified combin- 
ations of property scores. In a typical workflow, users can use the analysis 
mode to identify sequence properties and operational ranges that seem 
worth exploring in design mode. 

2.1 Sequence analyzer 

D-Tailor provides a generic interface for multidimensional interrogation 
of DNA sequences. The software is designed with a modular architecture, 
so that users with basic programing skills can easily implement or extend 
modules for handling any sequence property of interest. Such modules 
can be implemented using custom Python code or scripts connecting to 
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third party software (see the Tutorial available in the Supplementary 
Material). In analysis mode, D-Tailor reads a set of sequences in either 
delimiter separated (e.g. CSV) or FASTA format files. A property profile 
is then computed for each of the input sequences by successively calling 
the analysis modules specified by the user (Fig. 1, left panel). 

D-Tailor currently comprises 14 different modules to compute various 
sequence properties involved in diverse mechanisms of gene regulation. 
This collection of sequence property evaluators includes algorithms to 
score promoter regions or transcription factor binding sites based on 
sequence logos (Thomas- Chollier et al., 2011), estimate translation initi- 
ation rates based on the Shine-Dalgarno (SD) sequence (Shine and 
Dalgarno, 1975), predict propensity to form RNA structures, calculate 
nucleotide composition or compute the codon adaptation index (CAI) for 
a given gene sequence (Sharp and Li, 1987). Although the implementation 
of the different sequence property evaluators is usually self-contained 
within D-Tailor, the computation of specific properties may rely on 
third party softwares [e.g. UNAfold (Markham and Zuker, 2008) for 
the prediction of RNA secondary structure]. Together, these modules 
illustrate diverse implementation modalities and provide useful examples 
to guide future extensions (see Supplementary Material). The specifica- 
tion of adequate analysis routines is an essential prerequisite to running 
the design mode. 

2.2 Sequence designer 

As capacities for DNA synthesis increase exponentially, the ability to 
computationally design artificial sequences need to become more auto- 
mated and transparent. The most innovative feature of D-Tailor is to 
provide a generic solution of designing synthetic sequences constrained by 
multiple properties of interest (Fig. 1, right panel). 

The design process in D-Tailor starts with the specification of a seed 
sequence and the desired design objective (i.e. the DoE) (Fig. 1, right 
panel). Seed sequences serve as a template to bootstrap the evolutionary 
design process. Typically, users would use a particular sequence of inter- 
est from which they want to derive a mutational series. The DoE enu- 
merates combinations of sequence properties that need to be generated, 
each of which constitutes a design target. D-Tailor provides a flexible 
scheme for the definition of DoE, which can vary from full-factorial to 
entirely customized designs. 

The definition a finite number of targets requires the discretization of 
continuous property scores into a finite number of nominal or ordinal 
levels. For example. Figure 1 shows the discretization of two sequence 
property scores into three ordinal levels (low, medium and high). This 
framework markedly differs from usual multi- objective optimization 
approaches (Chung and Lee, 2012; Raab et al., 2010; Racle et al., 
2012), which operate to optimize a single continuous and integrated per- 
formance score rather than explicitly target different regions of the 
parameter space. As illustrated in the Section 3, natural feature profiles 
extracted from available genomic sequences can be used to guide the 
discretization processes and ensure biological relevance of the sampled 
space. For each sequence property, users may define as many levels as 
necessary to attain the desired degree of resolution in the designed se- 
quences. However, since the number of possible combinations increases 
geometrically with the number of properties/levels, their definition must 
be mindful of downstream experimental capacities. 

Finding a sequence that conforms to an arbitrary combination of 
property levels is often computationally infeasible using a brute force 
approach. Indeed, the sequence space to be searched is gigantic 
(4^ where is the number of nucleotides in the sequence to be designed, 
more if indels are allowed). To optimize the search process, D-Tailor uses 
a Monte Carlo algorithm to evolve a given seed sequence towards the set 
of design targets (Fig. 2). 

More specifically, the algorithm loops through cycles of evolution until 
all target combinations of property levels specified by the DoE are found. 
Each cycle consists in three consecutive steps: (i) a target combination of 



Sequence evolver 




Updated 
template 



Fig. 2. Sequence designer algorithm comprised by three different steps 
described in the main text. Initially, a target combination of features is 
selected and then a sequence that is close (i.e. short Euclidean distance) to 
the desired target is chosen to serve as the template in the sequence evo- 
lution step. This last step applies successive mutations until it finds a 
sequence matching the target combination of features 



property levels is randomly selected; (ii) a template sequence is chosen 
from the repository of previously generated sequences using fitness pro- 
portionate selection (only seed sequences are available at the very first 
iteration); and (iii) a predefined number of mutational iterations are per- 
formed until a sequence satisfying the target combination of the property 
level is found (Fig. 2, sequence evolver). We use the inverse of the 
cumulative Euclidean distance (D) between property levels, as a gen- 
eric fitness measure of a sequence relative to a given design target in 
Equation (1) 



D- 



(1) 



where n represents the number of sequence properties; d, and t, represent 
the levels of the z-th sequence property in the designed sequence and the 
desired combination, respectively. 

Each iteration of the sequence evolver also comprises three steps: (i) the 
sequence being evolved is analyzed and a property requiring optimization 
(i.e. not within the target level) is randomly selected; (ii) the template se- 
quence is then mutated following user-specified mutational rules (see 
below); and (iii) the feature scores of the resulting sequence are analyzed 
and evaluated with respect to the current design target [Equation (1)]. 
Every generated sequence is also screened for compliance to a user-defined 
set of rules meant to prevent the emergence of undesired properties in the 
final designed sequences (e.g. restriction sites, unexpected promoters or 
terminators). Only validated sequences are stored in the database. 

Next, if the new sequence matches the target combination (D = 0), then 
the target is marked as completed and the evolution cycle is terminated. 
Otherwise, the algorithm updates the template for the next mutational 
iteration, choosing between retaining the current template sequence or 
accepting the mutant just derived. At this point, we defined three different 
selective regimes: (i) directional selection, where the sequence with the 
lower Euclidean distance to the target combination is chosen; (ii) neutral 
selection, where any of the two sequences is selected with predefined 
probabilities; or (iii) temperature selection, as inspired by simulated 
annealing optimization (Kirkpatrick et al., 1983), where the sequence is 
selected based on a temperature schedule that allows worse sequences 
(longer distances) to also be selected with a probability that decreases 
with the number of iterations performed. 

At each of the mutational iterations, new sequences can be generated 
through random mutation of the template sequence, as usual in many 
sequence optimization tools (Chung and Lee, 2012; Caspar et al., 2012; 
Salis et al., 2009). In addition, D-Tailor offers the possibility to imple- 
ment specialized mutation operators that aim at improving the likelihood 
to generate desired property changes. Practically, a mutation operator 
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randomly selects a property amongst those that are non-optimal in the 
current template (d,-t, /0). We then distinguish between: (i) targeted 
operators, which restrict the mutational space to specific regions of the 
sequence that are therefore more likely to affect the property that needs to 
be evolved; and (ii) oriented operators, which further specify particular 
mutation patterns to bias the production of variants toward the current 
design target. For example, if the design goal specifies an increase in the 
CAI of a gene, the targeted mutation operator restricts the mutable 
region to the coding sequence and randomly replaces a codon by another 
one irrespective of its usage score. The oriented mutation operator further 
constrains the replacement of a randomly chosen codon with one asso- 
ciated with a higher usage score, thereby enforcing the required increase. 
For certain emergent features, the definition of oriented mutation might 
not be so straightforward. For example, we implemented oriented muta- 
tion operators for RNA secondary structure by specifically targeting 
mutations to bases that are predicted to be paired or unpaired, to, respect- 
ively, decrease or increase the strength of the mutated RNA structure. 
Importantly, any mutation operator targeting gene-coding sequences can 
be further constrained to only generate synonymous mutations, thereby 
preserving the encoded protein sequence while modifying the underlying 
DNA properties. 

In some applications, it may be desirable to limit the overall divergence 
between sequences in the designed library, so that it provides small vari- 
ations with respect to a particular reference sequence. Conversely, users 
might want to generate sequences that are as dissimilar as possible and, 
therefore, share as few confounding factors as possible. In D-Tailor, users 
can manipulate mutational patterns and the selective regime — two major 
parameters of the evolutionary design process — to indirectly control se- 
quence diversity, and consequently impact the rate of sequence evolution, 
as well as the overall performance of the search algorithm (see below). 

3 RESULTS AND VALIDATION 

D-Tailor provides an integrated Python- scripting framework for 
multidimensional analysis of sequence properties and for the 
design of artificial sequences constrained by multiple sequence 
properties of interest. 

As a case study, we have chosen three different previously 
reported sequence determinants of translation efficiency. In 
E.coli, two major factors have been shown to modulate the 
rate of translation initiation: (i) the strength and position of a 
SD motif upstream of the start codon (Barrick et ai, 1994; Shine 
and Dalgarno, 1975); and (ii) the propensity of these sequence 
signals to engage in mRNA secondary structures (de Smit and 
van Duin, 1994; Hall et ai, mi; Kudla et al., 2009). Subsequent 
to initiation, the rate of elongation may also affect the overall 
translation efficiency and is mainly determined by the codon 
usage of the gene (Gustafsson et al., 2004; Ikemura 1985; 
Kane, 1995; Sharp and Li, 1987; Welch et ai, 2009, 2011). We 
first illustrate how D-Tailor analysis module can be used to 
examine such sequence properties in the natural genome of 
E.coli. Then, we demonstrate how to use D-Tailor design 
module to generate artificial sequence libraries systematically 
varying the three properties of interest according to a full- 
factorial DoE. 

3.1 Using D-Tailor to interrogate sequences 

We used D-Tailor to re-analyze three different sequence proper- 
ties across the entire E.coli W3110 genome (Fig. 3). 
Mechanistically, the SD motif stabilizes the initial binding of 
the BOS subunit of the ribosome by estabhshing canonical base 




16S:SD{dG) CAI 16S:SD(AG) 



Fig. 3. (A-C) Distribution of the three different sequence properties 
(hybridization energy between the 16S rRNA and SD sequence (A), min- 
imum folding energy of RNA structure in the translation initiation region 
(B) and CAI of gene sequences (C)) influencing translation efficiency in 
E.coli. The dashed lines indicate the quintile boundaries for the scores of 
each property, which were later used in design mode to discretize the 
parameter space. (D-F) Scatter plots showing the cross-correlation 
between the three sequence properties of interest 



pairing with the 3'-end of the 16S rRNA (anti-SD) (Shine and 
Dalgarno, 1975). We applied a sequence property evaluator that 
calculates the strength of the SD sequence by searching for a 
subsequence within the 25 nucleotides upstream of a start 
codon with highest affinity to the known anti-SD (Lithwick 
and Margalit, 2003). The presence of secondary structures in 
this region of the mRNA can hinder initiation by occluding 
the SD motif or the nearby start codon from recognition by 
the ribosomal subunits. For that purpose, we used an RNA- 
structure evaluator to compute the minimum free energy of the 
60 nucleotides subsequence centered on the start codon (Kudla 
et al., 2009). Finally, we used a CAI calculator to score the codon 
usage of a gene sequence (Sharp and Li, 1987). Practically, the 
usage of these property evaluators and associated parameters 
requires a standard interface, which is provided by extending 
the abstract class Feature in D-Tailor (see Supplementary 
Material). 

The sequence property profiles resulting from a genome ana- 
lysis give a solid basis to identify trends in the properties of 
interest, and to further determine the relevant parameter space 
to explore during the design step (Fig. 3A-C). Correlations 
amongst property scores may also provide insights onto potential 
functional interactions although some may be purely incidental. 
For example, the modest correlation between RNA structure in 
the translation initiation region and the affinity between ribo- 
somes and the SD sequence (Fig. 3D) might merely reflect the 
thermodynamic propensity of G-rich SD motifs to engage in 
secondary structures. Similarly, the peculiar shape of the rela- 
tionship between CAI and RNA secondary structure (Fig. 3E) 
might stem from the joint contributions of independent evolu- 
tionary pressures for expression levels acting on these two prop- 
erties to tune expression levels [highly expressed genes being both 
under selection for high CAI and for low structure (Gu et al., 
2010; Kudla et ai, 2009; Plotkin and Kudla, 2011; Tuller et al., 
2010)]. It is then up to the user to define a DoE containing 
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combinations of sequence property scores that are more 
adequate to test the research hypothesis to be investigated. 

3.2 Using D-Tailor to implement experimental design on 
sequence properties 

Although recent advances in DNA synthesis, cloning and 
sequencing make it possible to generate and experimentally 
probe thousands of custom DNA/RNA sequences (Dvir et aL, 
2013; Kosuri et aL, 2013; Patwardhan et aL, 2009, 2012; Quan 
et aL, 2011; Sharon et aL, 2012; Smith et aL, 2013), the availabil- 
ity of computational tools to aid the rational design of large 
sequence libraries remains very limited. 

The main purpose of D-Tailor is to provide a flexible compu- 
tational tool to design custom sequences satisfying complex 
specifications. Such task can be extremely laborious when the 
properties of interest physically overlap in the sequence space. 
For instance, in our case study, the subsequence containing the 
SD motif influences the formation of RNA secondary structures 
in that same region. Likewise, the secondary structure can be 
affected when modifying codon usage at the beginning of the 
gene. Typically, such optimization problems are best solved 
using a trial-and-error approach wherein sequence variants are 
generated using random mutations until a desired combination 
of property scores is found (Allert et aL, 2010; Caspar et aL, 
2013; Raab et aL, 2010; Racle et aL, 2012; Salis et aL, 2009). 
To generalize this process, the design mode of D-Tailor provides 
a framework to integrate any sequence property evaluator into a 
parameterizable Monte Carlo algorithm that iteratively evolves 
sequences toward a specific set of design targets (or combinations 
of property levels). 

We used D-Tailor to design sequences that systematically 
vary the three properties of interest (or factors) defined above 
(Fig. 3). For each of these factors, we defined five contiguous 
ordinal levels on the basis of the quintiles observed in the natural 
genome (Fig. 3A-C, dashed lines). We then instructed D-Tailor 
to search for sequences conforming to a full-factorial DoE based 
on these levels. This DoE describes a total of 125 design targets 
corresponding to all combinations of five levels across the three 
different properties (5^). To validate our approach, we compared 
the performance of four increasingly complex evolutionary stra- 
tegies available in D-Tailor at deriving full-factorial libraries for 
30 different genes randomly selected in E.coli (Fig. 4 A and B). 
In these simulations, the algorithm was run for at most 3000 
generations — ^with a single mutational event per generation — 
allowing for unrestricted mutations in the 5' UTR but only for 
synonymous mutations in the coding sequence. 

We first explored the most rudimentary evolutionary strategy 
available in D-Tailor, random sampling, which does not imple- 
ment any heuristic search and simply generates random se- 
quences until all desired design objectives are completed. Every 
attempt to complete the full-factorial design before the threshold 
of 3000 generated sequences failed (Fig. 4A and B, black line, 
54.2 generated sequences per target found [gspt] on average). The 
second design strategy used D-Tailor's generic heuristic algo- 
rithm (Fig. 2 and Section 2) along with the simplest mutational 
method wherein new sequence variants are generated by random 
mutagenesis (Fig. 4A, yellow line). This strategy improved the 
efficiency of the search algorithm by a factor of 2 as compared to 




0 12 3 random random targeted oriented 

Numberof generated sequences (x1,000) sampling Mutational strategy 




01234567 01234567 
Euclidean distance to target combination Euclidean distance to target combination 



Fig. 4. D-Tailor design simulations. (A) We performed simulations of 
full-factorial design using 30 different initial sequences (seeds) and four 
different design strategies: random sampling (black) and heuristic search 
using random (yellow), targeted (blue) and oriented (orange) mutations. 
The different lines represent the average number (across 30 simulations) 
of target combinations found (out of 125) as a function of the number of 
generated sequences (up to 3000) for the four different strategies. We 
observed sizeable variation between seeds (not shown for clarity, see 
Supplementary Material for details). (B) Number of generated sequences 
per target found (gspt) for the four different mutational strategies 
(n = 30). (C) We used the same 30 different seeds to find six different 
target combinations at various Euclidean distances. The different lines 
show the average hamming distance between the seed and the sequence 
matching the target combination as a function of the Euclidean distance 
to the target combination using neutral (light blue), directional (orange) 
or temperature selection (black). (D) The number of generated sequences 
until the desired target is found as a function of the Euclidean distance to 
the target combination using either neutral (light blue), directional 
(orange) or temperature (black) selection 



that of the random samphng method (24.8 versus 54.2 gspt on 
average, Mann- Whitney test P-value = 2.3 x 10~^^, Fig. 4B). 
Still, many sequences had to be generated to meet the various 
design objectives. The third mutational strategy employed spa- 
tially targeted mutation operators (see Section 2) and improved 
the search algorithm efficiency by another factor of 2 (13.3 gspt 
on average. Fig. 4B). The fourth strategy used more 'rational' 
mutation operators that exphcitly orient mutations toward the 
desired objective (see Section 2) and provided shghtly faster dy- 
namics (Fig. 4A, orange line, 11.8 versus 13.3 gspt on average, 
Mann- Whitney test P-value = 0.129, Fig. 4B). Since the compu- 
tational time necessary to achieve a given set of design targets is 
dependent on the number of generated sequences per target, 
these results illustrate the advantage of defining specific mutation 
operators whenever it is possible. 

When designing synthetic sequences, users may want to limit 
the divergence of the designed sequences with respect to the ini- 
tial seed. To roughly control the spread of the generated se- 
quences during the evolutionary process, users can manipulate 



1091 



J.C.Guimaraes et al. 



the strength of selection toward the desired target(s). To better 
illustrate this point, we evolved each of the 30 seeds previously 
selected toward six different target combinations bearing differ- 
ent Euclidean distances from the seeds (Fig. 4C and D). We then 
examined the behavior and results of the algorithm in response 
to three contrasted selective regimes: neutral, directional and 
temperature selection (Section 2). 

As expected, we observed that a more relaxed selection process 
(neutral) is able to generate sequences matching the desired 
target that are more similar to the seed sequence than those re- 
sulting from the directional or temperature selection approach 
(average hamming distance of 21 versus 31.3 and 39.2, respect- 
ively; Mann-Whitney test P- value = 0.0005 and 1.03e-13; 
Fig. 4C). Nonetheless, the limitation of sequence diversity 
comes at the cost of longer computation time (Fig. 4D). 
In fact, for the 30 seed sequences, the neutral selection process 
requires the generation of eight and six times more sequences per 
target than the directional and temperature selection approach, 
respectively and on average. For large designs, users may have to 
balance the desired divergence of the designed sequences with the 
available computational power. A hybrid approach, wherein the 
algorithm is initially set with weak selection and hard constraints 
to limit divergence, and then progressively configured with 
increased selection bias and/or relaxed mutational constraints 
(e.g. allow non- synonymous mutations in coding sequences if it 
is acceptable by the user) as the rate of target discovery slows 
down may then be recommended. The details of such procedure 
are likely specific for each application, and therefore we have not 
sought to implement an automatic schedule to control this 
behavior. Since the state of a D-Tailor design mode run is per- 
manently stored in a database, we suggest users to manually 
experiment with adjusting these parameters. 

4 CONCLUSION 

Advances in DNA-reading/writing technologies readily enable 
the construction and validation of complex genetic systems 
(Gibson et al., 2010). However, rules to successfully design syn- 
thetic sequences to functional specifications have been limited by 
measurements from biased natural samples and/or small scale 
controlled synthetic samples comprising at most hundreds of se- 
quences (e.g. AUert et al. (2010), Amit et al. (201 1), Barrick et al. 
(1994), Garcia et al. (2012); Mutalik et al. (2012); Na et al. 
(2013); Rhodius and Mutalik (2010); Rodrigo et al. (2012), 
SaHs et al. (2009)]. This lack of knowledge strongly restrains 
the practical applications of ab initio design. Innovative experi- 
mental methodologies based on high-throughput technologies 
are scaling the characterization process up to tens of thousands 
of designed sequence variants, providing larger datasets to better 
understand sequence/activity relationships (Dvir et al., 2013; 
Kinney et al., 2010; Patwardhan et al., 2009, 2012; Sharon 
et al., 2012; Smith et al, 2013). However dramatic, this increase 
in throughput remains limited in comparison to the sheer im- 
mensity of the sequence space. It is therefore crucial to reduce the 
dimensionality of the design space to a set of sequence properties 
of interest that can be independently varied to facilitate estima- 
tion of their individual contribution to the measured phenotype 
and further support predictable design of synthetic variants 
(AUert et al, 2010; Sharon et al, 2012; Smith et al, 2013). 



We developed D-Tailor as an extendable and flexible software 
platform for the multi-objective design of artificial sequences. 
It provides a generic interface to integrate multiple sequence 
analysis tools into a heuristic Monte Carlo search procedure 
capable of evolving sequences towards pre-defined design targets 
(Fig. 1). D-Tailor presents significant differences to other multi- 
objective sequence optimization tools (AUert et al, 2010; Chung 
and Lee, 2012; Dana and Tuller, 2012; Caspar et al, Raab 
et al, 2010; Racle et al, 2012; SaUs et al, 2009). First, it allows 
the definition of multiple design targets as combinations of se- 
quence properties that embody particular DoE. A DoE can 
range anywhere from one specific combination of property 
levels to a full-factorial design, where the parameter space is 
fully explored. In contrast, traditional optimization tools de- 
scribe design objectives in terms of desired response perform- 
ances, which are linked to the sequence properties by a 
complex and pre-defined static objective function. Such formal- 
ization is suited for functional optimization, but do not explicitly 
support systematic exploration of the parameter space. Second, 
D-Tailor provides an evolutionary algorithm to optimize both 
coding and non-coding regions. Third, D-Tailor supports the 
implementation of advanced mutational strategies that can sig- 
nificantly enhance the heuristic search performance (Fig. 4B). 
Finally, our tool is not application- specific and provides an 
open source solution based on an extendable architecture, such 
that new sequence property evaluators can be easily implemented 
and integrated into the sequence design engine. 

We demonstrate that D-Tailor can efficiently design artificial 
sequences to systematically vary any given set of properties 
of interest. To this end, we successfuUy derived full-factorial 
sequence libraries, starting from 30 different seed sequences, 
exploring the entire parameter space of three intertwined se- 
quence properties affecting translation efficiency. Interestingly, 
we observed that the dynamics of target discovery varies sUghtly 
depending on the input seed (see Supplementary Material for 
details). This illustrates that different sequences may have 
distinct evolutionary landscapes; some being more amenable to 
generate widely variable profiles of property scores, with fewer 
mutational cycles than others (Cambray and Mazel, 2008; 
Wagner, 2008). For both targeted and oriented mutational meth- 
ods, the average dynamics of target discovery revealed a rela- 
tively steady rate for the first ^^80% of targets, followed by a 
sharp decrease in efficiency — ^presumably because the remaining 
targets specify combinations of property levels that are harder to 
attain (Fig. 4A, orange and light blue lines). We also confirmed 
that more simpUstic design approaches — such as generation of 
random sequences — perform poorly in comparison to a heuristic 
search (Fig. 4 A and B). 

In addition to the case study detailed here, we have used 
D-Tailor to systematicaUy design synthetic bacterial promoter 
sequences varying multiple c/^-regulatory properties (see 
Tutorial in Supplementary Material for details), that way 
demonstrating the generality and flexibiUty of our methods 
and tool. 

D-Tailor permits the implementation of advanced experimental 
designs into artificial sequence samples that can serve as a basis 
to rigorously and consistently test sets of molecular hypothesis. 
We beUeve that comprehensive full-factorial Ubraries of sequences 
are needed to investigate complex biochemical activities and 
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robustly dissect the contribution of individual factors as well as 
their interactions. Such libraries will aid characterizing complex 
multifactorial phenotypes and eventually derive quantitative rela- 
tionships between sequence and activity. 
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